Protein Engineering, Design and Selection — Latest Matching Preprints

1

BoltzGen: Toward Universal Binder Design

Stark, H.; Faltings, F.; Choi, M.; Xie, Y.; Hur, E.; O'Donnell, T. J.; Bushuiev, A.; Ucar, T.; Passaro, S.; Mao, W.; Reveiz, M.; Bushuiev, R.; Portnoi, T.; Pluskal, T.; Sivic, J.; Kreis, K.; Vahdat, A.; Ray, S.; Goldstein, J. T.; Savinov, A.; Hambalek, J. A.; Gupta, A.; Taquiri-Diaz, D. A.; Zhang, Y.; Snyder, S. J.; Hatstat, A. K.; Arada, A.; Kim, N. H.; Fan, H.; Tackie-Yarboi, E.; Boselli, D.; Schnaider, L.; Liu, C. C.; Li, G.-W.; Hnisz, D.; Sabatini, D. M.; DeGrado, W. F.; Wohlwend, J.; Corso, G.; Barzilay, R.; Jaakkola, T.

2026-06-16 bioengineering 10.1101/2025.11.20.689494 medRxiv

Top 0.1%

7.7%

Show abstract

We introduce BoltzGen, an all-atom generative model for designing proteins and peptides across all modalities to bind a wide range of biomolecular targets. BoltzGen builds strong structural reasoning capabilities about target-binder interactions into its generative design process. This is achieved by unifying design and structure prediction, resulting in a single model that also reaches state-of-the-art folding performance. BoltzGens generation process can be controlled with a flexible design specification language over covalent bonds, structure constraints, binding sites, and more. We experimentally validate these capabilities in eight diverse design campaigns with functional and affinity readouts across 26 targets. In our experiments, binder modalities span from nanobodies to disulfide-bonded peptides, and targets from disordered proteins to small molecules. In particular, we identify nanobody binders for novel targets with low similarity to proteins with already known bound structures. We release model weights, data, and both inference and training code at: https://github.com/HannesStark/boltzgen.

2

Precision at Every Scale: Efficiency in AI-Driven De Novo Antibody Design

Cha, H.; Cho, K.; Gu, J.; Gwak, D.; Ham, S. W.; Hong, M.; Kim, S.; Kim, S.; Kwon, S.; Lee, C.; Lee, D. K.; Lee, D.; Lee, D.; Lim, J.; Noh, J.; Oh, S.; Park, E.; Park, S.; Park, T.; Ryu, E.; Ryu, S.; Sa, D. H.; Seok, C.; Sim, J.; Song, M. Y.; Won, J.; Woo, H.; Yang, J.

2026-05-15 bioengineering 10.1101/2025.11.21.689414 medRxiv

Top 0.1%

3.5%

Show abstract

The precise de novo design of antibodies remains a therapeutic challenge. The AI platform, GaluxDesign, was evaluated in a high-efficiency Precision-Scale Workflow by synthesizing and testing only 50 full-length IgG candidates per epitope across eight distinct epitopes from six therapeutic targets. This campaign yielded a 10.5% binder rate (estimated EC50 < 100 nM), identifying target-specific binders for seven of eight epitopes, with multiple candidates exhibiting sub-nanomolar to single-digit nanomolar dissociation constants (Kd). We further assessed the same workflow on nine shared benchmark targets selected for external comparison, where GaluxDesign identified target-specific binders for eight of nine targets, demonstrating strong target-level performance relative to previously reported de novo antibody design approaches. Together, these results establish a high-efficiency, precision-scale workflow for generating novel, high-affinity therapeutic antibodies.

3

Overestimating zero-shot fitness prediction: Broad benchmarks mask local failures and practical limitations

Woolley, P. R.; Feller, A.; Ellington, A. O.; Wilke, C. O.

2026-06-07 bioengineering 10.64898/2026.06.04.730121 medRxiv

Top 0.1%

3.2%

Show abstract

Deep learning models have emerged as promising tools for navigating mutational landscapes in protein engineering. These models can be used to predict mutation fitness without the need for task-specific training, a process known as zero-shot prediction. However, their practical utility remains only partially characterized. Here, we evaluate the zero-shot performance of a panel of protein sequence and structure models across a range of benchmarking conditions, focusing on factors that complicate the interpretation of aggregate metrics. We show that input modality (sequence vs. structure) does not dictate performance on phenotypic tasks. Instead, performance is sensitive to experimental variability and is heavily confounded by correlation between phenotype and protein abundance. While available models may act as coarse filters separating fit mutations from deleterious ones, they cannot meaningfully rank a set of fit mutations or prioritize new-to-nature functions. Ultimately, the practical utility of zero-shot prediction from protein models is narrower than aggregate benchmarks imply.

4

CCK* (Convex Closure K*): A Suite of Algorithms for De Novo L- and D-peptide Design

Childs, H.; McBride, A. C.; Donald, B. R.

2026-06-01 bioinformatics 10.1101/2025.11.21.689740 medRxiv

Top 0.1%

3.2%

Show abstract

The computational design of L-peptides and their mirror-image counterparts, D-peptides, is an active area in drug design. Peptide therapeutics offer exceptional structural diversity and high binding specificity, while D-peptides additionally confer critical advantages such as proteolytic resistance. Progress in de novo D-peptide design has been hindered by the absence of evolutionary context and limited structural data, both of which underpin the deep learning methods widely used in L-peptide design. Consequently, a robust framework capable of designing both L- and D-peptides should integrate data-driven inference with first-principles, physics-based modeling. Here, we introduce a unified computational framework that supports de novo design of both L- and D-peptides, thereby expanding the accessible design space across both chiral spaces. Convex Closure K* (CCK*) is a suite of chirality-agnostic algorithms: SCOPE, MONTAGE, and ARISE. SCOPE uses geometry as a proxy for chemical energetics, computing convex hull representations of rotameric states to rapidly generate multi-sequence protein contact maps. MONTAGE employs geometric hashing in conjunction with the K* algorithm to generate and rank backbone scaffolds according to their suitability for sequence design. ARISE is a K*-based sequence design algorithm that performs iterative residue assignment in an undirected graph to design high-affinity peptide sequences. We apply the full CCK* suite to six de novo design tasks, benchmarking chirality-preserving and chirality-inverting designs in both homochiral and heterochiral complexes.

5

AI-assisted improvement of Aspergillus oryzae β-galactosidase using an Ensemble of Protein Language Models

Trapote Fernandez, A.; Fernandez, A.; Mendez-Liter, J. A.; Prieto, A.; Barriuso, J.; Osorio, F. G.

2026-05-21 synthetic biology 10.64898/2026.05.20.726739 medRxiv

Top 0.1%

3.1%

Show abstract

{beta}-galactosidases (BGs) are essential enzymes widely used in the food industry, particularly in the production of lactose-free products. Among them, the BG from Aspergillus oryzae is of industrial relevance due to its activity at acidic pH and moderate thermal tolerance. However, enhancing its catalytic performance remains a key challenge. Traditional enzyme engineering methods are time-consuming and resource-intensive, limiting their scalability. Recent advances in Artificial Intelligence (AI), particularly those based on Natural Language Processing, offer a promising alternative by enabling efficient exploration of protein sequence space and prediction of beneficial mutations. In this study, we introduce an ensemble-based, zero-shot Protein Language Model pipeline that reconciles predictions from six independent models (ESM2 and the five ESM1v variants) combined with a diversity-aware candidate selection strategy. Applied to the BG from A. oryzae, this approach identified beneficial mutations leading to novel enzyme variants with up to a four-fold increase in catalytic efficiency on oNPGal, a two-fold increase on lactose, and, independently, a T338I variant with markedly enhanced thermostability ({approx}80% residual activity after 24 h at 60 {degrees}C), all without requiring supervised fine-tuning on experimental fitness data. Our results demonstrate that consensus across an ensemble of PLMs can efficiently enrich beneficial substitutions in industrially relevant enzymes and substantially reduce the number of wet-lab candidates that need to be screened. Table of Contents graphic O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=106 SRC="FIGDIR/small/726739v1_ufig1.gif" ALT="Figure 1"> View larger version (29K): org.highwire.dtl.DTLVardef@18084f7org.highwire.dtl.DTLVardef@99a102org.highwire.dtl.DTLVardef@19a64forg.highwire.dtl.DTLVardef@1f59cff_HPS_FORMAT_FIGEXP M_FIG C_FIG

6

Objective curriculum-guided design of multi-property proteins

Liu, L.; Zhao, J.; Xie, X.; Xu, S.; Ren, M.; Zhang, X.; He, Z.; Liu, F.; Yu, C.; Wang, K.; Wang, X.; Liang, X.; Ye, X.; Bu, D.; Zhou, H.

2026-05-28 bioengineering 10.64898/2026.05.25.727596 medRxiv

Top 0.1%

2.4%

Show abstract

Designing functional proteins that simultaneously possess multiple bio-chemical properties remains a significant challenge, as key protein properties, such as solubility, stability, binding affinity, and chemical resistance, are often interdependent or even conflicting. Current approaches typically attempt to jointly optimize multiple functional objectives in one shot, followed by extensive screening to identify rare feasible designs. Here, we introduce OCDesign, an objective curriculum-guided framework for multi-property protein design. OCDesign is based on the objective curriculum principle, i.e., the order in which objectives are introduced can shape the accessibility of functional solutions. In each design round of OCDesign, candidate sequences are generated in silico, assessed across multiple properties, selected based on Pareto-optimal trade-offs, and experimentally validated, with each experimental stage testing the role of the newly introduced objective within the curriculum. Using antibody-binding protein A as a model system, we show that one-shot optimization fails to yield functional designs, whereas a staged curriculum--progressing from solubility and structural consistency to binding affinity, and then to alkaline resistance--enables the design of proteins possessing multiple desired properties through substantially fewer wet-lab experiments. These results establish OCDesign as a practical computational-experimental strategy for organizing and integrating multiple objectives in protein design, and suggest that objective ordering is a key determinant of accessibility in high-dimensional design spaces.

7

Minimal Data, Maximal Insight (MDMI): A Structure-guided Pipeline for Discovering Functional Alternatives in Peptide-Protein Interfaces

Bayat, P.; Perkins, S. J.; Clancy, S.; Patel, S. S.; Yin, R. F.; Bozovicar, K.; Singh, S.; Shrestha, S.; Moustafa, Z.; Zayani, R.; IWE, I.; Bayat, S.; Kelly, P.; Vigar, J. R. J.; White, V. Y.; Xie, M.; Simchi, M.; Palter, S.; Nguyen, J.; Zeisler, I. Y.; Wu, B.; Pardee, K.

2026-07-14 synthetic biology 10.64898/2026.07.13.737974 medRxiv

Top 0.1%

2.3%

Show abstract

Discovering functional peptides across vast sequence space remains a formidable challenge, particularly when experimental training data is scarce. We present Minimal Data Maximal Insight (MDMI), a two-stage structure-guided computational pipeline that designs functional peptide variants using only a small, annotated dataset. Rather than relying on sequence information alone, MDMI integrates three-dimensional structural features derived from predicted peptide-protein complexes into a machine learning model that captures interface geometry and binding energetics. This structure-aware predictor, paired with a genetic algorithm for sequence exploration, reduced false positives from 70% to close to zero in an all-negative benchmark panel compared with a sequence-only model in computational benchmarking, and produced approximately four-fold more high-confidence in silico binders than state-of-the-art peptide/protein design baselines. Using the split-GFP system as a testbed, where fluorescence provides a direct functional readout of peptide-protein complementation, MDMI identified peptides with up to 38% sequence divergence from wild-type in Stage 1 while retaining measurable activity. In Stage 2, motif-guided recombination of successful Stage 1 variants produced highly divergent yet functional peptides bearing over 50% sequence difference from wild-type, revealing two distinct functional clusters in sequence space. As further validation, a top-performing candidate expressed as a full-length GFP fusion retained a GFP-like emission profile, supporting formation of a fluorescent GFP-like scaffold. These results demonstrate that structure-informed pipelines can uncover remote functional sequence space from minimal data, with broad implications for peptide and therapeutic analog discovery.

8

Scalable Production of a De Novo SARS-CoV-2 Antiviral miniprotein in Escherichia coli

Shin, J.; KIm, E.-m.; Jang, J.-h.; Jee, S.-w.; Kim, S.-h.; Yu, S.; Yoon, M.; Craig, D.; Swoyer, R.; Alamuri, P.; Price, A.; Patel, S.; Ravichandran, R.; Carter, L.; Pallerla, S.

2026-06-24 bioengineering 10.64898/2026.06.23.734092 medRxiv

Top 0.1%

2.1%

Show abstract

The rapid emergence of SARS-CoV-2 variants that evade neutralizing antibodies underscores the need for next-generation antiviral biologics that combine molecular precision with scalable, cost-effective manufacturing. Computationally designed miniproteins targeting the receptor-binding domain (RBD) of the spike protein offer a compelling alternative to monoclonal antibodies due to their small size, high thermal stability, and compatibility with microbial expression systems. Here we report the end-to-end development and cGMP production of IPD-52520, a de novo antiviral miniprotein, using an optimized E. coli platform. Two miniprotein candidates, a homotrimeric construct (Trimer is referred to as IPD-52520, 17 kDa) and a tandem fusion (Daisy is referred to as IPD-52521, 25 kDa), were evaluated in parallel through systematic optimization of strain selection, media composition, fed-batch fermentation, inclusion-body solubilization, refolding, and chromatographic purification. The Trimer was downselected as the lead molecule based on superior preclinical efficacy, favorable pharmacokinetic properties, and higher volumetric manufacturing yields. The optimized process delivers approximately 2 g/L of purified protein at greater than 90% purity. Scale-up from 5 L to 50 L under cGMP conditions demonstrated excellent batch-to-batch reproducibility across six independent batches, supporting nonclinical and Phase 1 clinical supply. Comprehensive biophysical characterization confirmed a well-folded, predominantly alpha-helical trimer (Tm = 73.4 {degrees}C; polydispersity = 1.005) with an intact primary structure and strong target-binding affinity (KD < 1 pM). Real-time stability studies indicate that the drug substance is stable at 2-8 {degrees}C for at least 12 months, with ongoing stability studies. These results demonstrate the feasibility of translating computationally designed antiviral miniproteins into manufacturable biologics and provide a platform applicable to rapid-response therapeutics against current and future pandemic threats.

9

PolyFold: Evaluation of Open-Use Molecular Structure Prediction Algorithms to Inform Their Utility in Diverse Biological Applications

Stephenson, H.; Voicu, D.; Novakov, V.; Levy, M.; Marsilio, J.

2026-06-16 bioengineering 10.64898/2026.06.16.732304 medRxiv

Top 0.1%

2.1%

Show abstract

With the growing use of machine-learning-assisted pipelines for designing, characterizing, and optimizing biomolecules, the reliability of structure prediction models is increasingly important. PolyFold is a benchmarking framework developed to evaluate open-use structure prediction models, Boltz-2 and OpenFold 3, as commercially accessible alternatives to AlphaFold 3. We outline an end-to-end workflow automation tool to streamline input file creation, batch automation, and comprehensive analysis of model outputs for leading open-use structure prediction models. We curated an evaluation dataset of several thousand high-quality Protein Data Bank structures, homology-filtering against the training sets of both models to ensure a fair analysis. We then implemented an evaluation pipeline incorporating structural metrics (RMSD, TM-score, lDDT, etc.), interface metrics (DockQ, ilDDT, iRMSD, etc.), and physicochemical realism checks (based on bond lengths, angles, molecular internal energies, etc.). We identify key performance disparities, observing that Boltz-2 is generally superior to OpenFold 3, though the differential is partially attributable to residual homology leakage not accounted for by prevailing test set curation practices. We thus recommend a new method for homology-reducing when building a test set using length-weighted average fractional identity cutoffs rather than lowest chain fractional identity cutoffs. Even in eliminating residual leakage, Boltz-2 still performs better on full-set comparisons and a variety of important partitions (nucleic acids, protein-ligands, Ab-Ags, etc.). Both models are strong at folding monomeric structures, though struggle with homomultimer placement and small molecule physical realism, demonstrating enduring limitations of machine learning methods. This work is the first end-to-end, open-use, and reproducible platform for systematically assessing state-of-the-art structure prediction models. PolyFold enables practitioners to determine how models compare in performance on specific inference tasks and supports the broader adoption of accessible computational tools to facilitate biomolecular science.

10

Prediction-Guided Design of a More Developable FGF21 Construct

Bozkurt, C.; Nathanail, E.; Goteti, A.

2026-07-14 bioengineering 10.64898/2026.07.13.738140 medRxiv

Top 0.1%

1.8%

Show abstract

For structural-biology and protein-production pipelines, the hardest part of a difficult protein is not the biology -- it is obtaining a well-behaved sample for functional studies. Programs routinely stall at construct design, expression, and purification: deciding where to truncate, which tags to use, how to express, and how to purify so the protein survives concentration and handling. These decisions are still made largely by literature precedent and experimental experience, and they require trial-and-error before arriving at a functional construct for hard targets. We present a prospective, single-pair wet-lab case study testing whether an integrated computational platform can improve these decisions. For human fibroblast growth factor 21 (FGF21) -- a clinically important and stability-challenged metabolic hormone -- we compared two expression constructs produced side by side under the same experimental workflow, using two different design strategies: one designed by a scientist from the literature (reproducing the published core-domain construct, PDB 6M6E), and one designed by the Orbion platform -- an AI, prediction-guided protein-design system (orbion.life) -- which additionally generated the expression and purification protocols (executed scientist-in-the-loop). The platforms construct used an unconventional, longer C-terminal boundary not found in public sequence databases. Since the two constructs differ in more than one feature, we treat them as workflow-level designs throughout. The scientist construct gave a higher initial yield ([~]2.4 xmore protein recovered at affinity capture). The platform-designed construct, however, showed a more favourable downstream developability profile: it concentrated higher (1.4 vs 0.7 mg/mL) while remaining more monodisperse by dynamic light scattering (DLS). The scientist construct, in contrast, aggregated on concentration, so its initial-yield advantage did not survive: in the final concentrated sample the Orbion construct provided the more usable material for downstream studies. Computed for the mammalian host used, the platform had prospectively scored its own design higher (composite 68.7 vs 59.0 for the scientist-designed construct), and its predictions of yield, solubility, and disorder matched the wet-lab outcome. This is a single, deliberately scoped case study, not a population-level benchmark; the two constructs differ in more than one feature, and biological activity was not assayed. Alongside the bottlenecks of this approach discussed here, used as a decision aid, prediction-guided construct and protocol design has the potential to remove costly iteration cycles of protein production campaigns.

11

Design to Data for Mutant of β-Glucosidase B from Paenibacillus polymyxa: G23S

O'Donnell, A.; Abbas, G.

2026-04-30 biochemistry 10.64898/2026.04.27.721118 medRxiv

Top 0.1%

1.7%

Show abstract

{beta}-glucosidase (BglB) from Paenibacillus polymyxa was mutated (G23S, Rosetta/Foldit numbering; G26S, conventional numbering) to assess structural and functional changes. Foldit modeling and prior Design 2 Data (D2D) database results led us to hypothesize that this mutation would increase substrate binding affinity and catalytic efficiency, with a moderate reduction in thermal stability. The mutant protein was expressed, purified, and analyzed using kinetics and thermal stability assays. Relative to the wild-type (WT), G23S exhibited a similar binding affinity (similar Km), an approximately 2-fold increase in turnover number (kcat) and catalytic efficiency (kcat/Km), an almost 14-fold increase in maximum reaction velocity (Vmax) and a slight decrease in thermostability (T50). The results largely support the hypothesis, indicating that changes in residue 23 can enhance catalytic power while minimally compromising stability.

12

Staged heavy-chain filtering enables Fab discovery from combinatorially intractable library spaces

Kim, Y.; Kwon, H.; Hong, J.; Kang, C. K.; Park, W. B.; Kim, H.-R.; Lee, C.-H.

2026-05-13 bioengineering 10.64898/2026.05.10.724059 medRxiv

Top 0.1%

1.7%

Show abstract

BackgroundCombinatorial fragment antigen-binding (Fab) libraries encode an immense heavy-light chain pairing space, often exceeding 10{superscript 1} possible combinations, which far surpasses the diversity that can be experimentally constructed and screened in display systems. As a result, direct Fab screening samples only a small fraction of the theoretical search space, creating a practical bottleneck for functional binder discovery. ResultsHere, we frame Fab discovery as a staged search problem by decoupling heavy-chain (HC) and light-chain (LC) exploration. We implemented a sequential HC preselection-remating workflow in yeast surface display, in which antigen-reactive HC variants are first enriched and subsequently recombined with a diverse LC repertoire to reconstruct a focused Fab library. In a SARS-CoV-2 spike-targeted campaign, HC and LC libraries of 2.05 x 10 and 2.33 x 10 members corresponded to a theoretical pairing space of approximately 4.8 x 10{superscript 1} combinations. Sequential HC enrichment followed by LC remating allowed recovery of multiple functional Fab clones from a tractable library scale of approximately 10, including clones that shared a common HC scaffold but carried distinct LC partners. A representative recombinant IgG output showed broad but heterogeneous spike/RBD binding, measurable pseudovirus neutralization activity (EC = 11.1 nM), and compatibility with standard early biophysical characterization after full-length IgG reformatting. ConclusionsThese results provide proof of principle that combinatorial Fab discovery can be approached as a staged exploration problem under realistic library-size constraints. By focusing downstream Fab reconstruction on an antigen-compatible HC subspace, sequential HC preselection followed by LC remating offers a practical strategy for exploring otherwise intractable antibody pairing landscapes in eukaryotic display systems.

13

Structure-function studies of HRIKD-{triangleup}KI, a Minimal Kinase Domain of Human Heme-Regulated Inhibitor Kinase

Rajasekaran, M. B.; Booth, J.; Crepin, D. F.; Roe, S. M.; Zhou, L.; Gianga, T.-M.; Siligardi, G.; Gonzalez-Mendez, R.; Staikopoulou, M.; Hassan, H.; Oliver, A.; Mancini, E.; Spencer, J.

2026-07-07 biochemistry 10.64898/2026.07.06.735516 medRxiv

Top 0.1%

1.7%

Show abstract

EIF2alpha kinase heme-regulated inhibitor (HRI) is a novel target for haematological malignancies with modulators reported to trigger cell death via the HRI-eIF2alpha-ATF4 pathway. We report a protocol for producing the minimal kinase domain of full-length human HRI, termed HRIKD-delta-KI, where the unstructured 140 amino acid (aa) kinase insert (KI) within HRI kinase domain (HRIKD) is replaced with a 2aa glycine/serine (GS) linker. X-ray crystal structures were determined of apo-HRIKD-delta-KI and of its complex with ATP at 2.1 & 2.5 Angstrom resolution respectively. Both structures display a canonical bi-lobal kinase fold. However, they remain in a non-productive state with a displaced C-helix, disassembled R-spine, and a disordered activation segment hindering the substrate site. Biophysical assays (fluorescence based thermal shift & Synchrotron Radiation Circular Dichroism) demonstrate HRIKD-delta-KI retains its functional ligand-binding conformation. All together, these findings define structural and ligand-binding features of HRI to support ongoing drug discovery efforts in blood cancer.

14

Integrating Diffusion and Liquid AI Models for Predicting Peptide Affinity from mRNA Display Selections

Leaf, C. M.; Qi, P.; Gandhi, Y. P.; Jalali-Yazdi, F.; Ong, J. N.; Takahashi, T. T.; Kalia, R.; Roberts, R. W.

2026-05-11 bioengineering 10.64898/2026.05.05.723033 medRxiv

Top 0.1%

1.7%

Show abstract

In vitro selection and directed evolution technologies such as mRNA display, explore large libraries ([≥]1014 variants) and generate thousands to millions of functional polypeptide ligands to a variety of targets. Denoising diffusion implicit machine learning models (DDIMs) trained using display-derived deep sequencing data can greatly expand these functional sequences beyond what is accessible experimentally. However, methods are needed to predict peptide properties such as binding free energies ({Delta}G{degrees}). Here, we applied machine learning methods to predict binding free energies of both experimental and DDIM-generated peptide ligands against a target of interest, the oncogenic protein Bcl-xL. To do this, we trained a Closed-form Continuous (CfC) neural network using a dataset of 15,700 peptide ligands where pairs of sequences and their corresponding binding free energies ({Delta}G{degrees}) were used as inputs. This type of model was chosen due to its ability to represent irregular series. The resulting CfC model accurately predicts the rank order, within error, and binding free energies ({Delta}G{degrees}) for both experimental and DDIM-generated peptides, identifying five DDIM-generated peptides with single-digit picomolar affinities. Combining trained DDIM and CfC models offers a unified route to expand the scope of experimental ligand discovery, predict the molecular properties of both experimental and generated ligands, and highlights the utility of large quantitative datasets for making accurate in silico predictions of high-affinity peptide candidates. StatementHigh-throughput sequencing analysis of mRNA display libraries enables generating novel peptide ligands and expands the scope of functional sequences beyond what is accessible experimentally. Closed-form Continuous neural networks trained using sequences and their corresponding free energies accurately predict the binding free energies of both experimental and machine learning-generated peptides, enabling a route to quantitatively predict peptide properties using directed evolution data.

15

A Critical Sites-Driven and Light-weighted Protein Engineering Platform

Deng, Q.; Qiao, J.; Wang, C.; Ni, X.; Chang, Y.; Zhao, N.; Zhai, R.; Cui, H.; Li, X.; Jin, M.

2026-04-28 bioengineering 10.64898/2026.04.24.720551 medRxiv

Top 0.1%

1.5%

Show abstract

Protein language models (PLMs) provide a novel computational paradigm for deeply mining evolutionary information. Nevertheless, the discrepancy between the natural evolutionary fitness captured by their zero-shot predictions and actual industrial demands significantly constrains wet-lab success rates. To address this bottleneck, we developed CASPE, a light-weighted protein engineering platform consisting of the CAS and APCNet. CAS leverages gradient activation mapping and multi-layer attention matrices to transform the implicit representations of PLMs into explicit site-importance metrics. Working in tandem with APCNet, CASPE establishes a workflow encompassing the entire trajectory from site localization to residue prediction, which successfully overcomes the fitness misalignment issue, enabling the precise directed evolution of target protein properties. CASPE efficiently identifies thermostable (31-60%) and pH-stable (40-80%) mutants. Specialized models further boost its success in phytase evolution, significantly outperforming FoldX and ESM2-t33 in hit rates. By shifting from global saturated mutagenesis to targeted optimization of feature-relevant sites, CASPE streamlines enzyme evolution, yielding a higher discovery rate of beneficial mutants.

16

Deep learning based design of buried hydrogen bond networks with HBDesigner

Dieckhaus, H.; Harvey, B. T.; Mulikova, T.; Horenstein, J. T.; Nicely, N. I.; Randolph, N. Z.; Kuhlman, B.

2026-06-11 bioengineering 10.64898/2026.06.08.730848 medRxiv

Top 0.1%

1.5%

Show abstract

Accurate design of hydrogen-bonding (H-bonding) interactions is a longstanding goal in protein design, as they can facilitate specific protein-protein interactions while improving the solubility of the proteins in the unbound state. Despite this, computational design of H-bond networks remains underexplored in the deep learning era. Here, we present HBDesigner, a novel algorithm for H-bond network design. Through a combination of deep learning-based sampling and atomistic energy scoring, HBDesigner outperforms existing tools in designing connected H-bond networks onto protein scaffolds. We demonstrate the usefulness of HBDesigner by creating monomeric proteins with buried polar interactions and homodimers with extended interface H-bond networks, and by installing specificity into a family of homologous heterodimers where prior design tools fail to do so. The ability to design H-bond networks into arbitrary protein scaffolds should be broadly useful for a wide range of design applications.

17

A two-step selection method for in vitro evolution of translational proteins

Sakurai, A.; Shoji, K.; Ichihashi, N.

2026-05-10 synthetic biology 10.64898/2026.05.09.724044 medRxiv

Top 0.1%

1.1%

Show abstract

Improving the reconstituted translation system is a key requirement for bottom-up synthetic biology. Here, we developed a two-step in vitro evolutionary method that can be used for improving translational proteins. In this method, two distinct conditions were sequentially applied while maintaining genotype-phenotype linkage in water-in-oil droplets. Using this method, we performed in vitro evolution of four translation factors, IleRS, PheRS, EF-G, and EF-Tu, and identified mutations that modestly enhanced translation activity in in vitro expression assays. One of the EF-G mutations (P610S) increased activity per protein approximately 2-fold for the recombinant protein purified from E. coli. This selection method is useful for improving translational proteins for bottom-up synthetic biology.

18

Structure-guided computational design and mechanistic understanding of the p95HER2-targeting NAZ-mAb antibody and its variants

Rawat, P.; Kyte, J. A.; Greiff, V.; Dorraji, E.

2026-07-11 bioinformatics 10.64898/2026.07.07.736817 medRxiv

Top 0.1%

1.1%

Show abstract

Human epidermal growth factor receptor 2 (HER2) is an oncogenic receptor tyrosine kinase in breast cancer and other malignancies. A subset of HER2-positive tumours expresses 611-CTF-p95HER2, a tumour-specific, hyperactive truncated isoform associated with metastasis and treatment resistance that lacks most of the extracellular domain targeted by conventional HER2-directed antibodies. We previously developed NAZ-mAb (formerly known as Oslo-2), a monoclonal antibody against 611-CTF-p95HER2. Here, we describe a computational antibody-engineering workflow for designing variants of NAZ-mAb. Starting from the sequence alone, we modeled the NAZ-mAb-611-CTF-p95HER2 complex, generated a combinatorial mutational landscape using FoldX 5.0, and prioritized candidate variants using predicted interaction energy and developability criteria. Two variants representing distinct design strategies were selected for validation: an aromatic double mutant, NAZ-mAb v1 (L:S31W/L:H107W), and a conservative single mutant, NAZ-mAb v2 (L:S31M). Both variants were successfully expressed as recombinant IgGs; NAZ-mAb v2 achieved a five-fold higher recombinant expression yield than parental NAZ-mAb, while both variants retained antigen binding with a higher apparent signal than the parental antibody in indirect ELISA. However, Biacore two-state kinetic analysis revealed weaker affinities than the parental antibody (KD NAZ-mAb v1: 32.6 nM, NAZ-mAb v2: 9.45 nM vs. parental NAZ-mAb: 5.33 nM). These findings show that the computational workflow can generate experimentally tractable, antigen-engaging NAZ-mAb variants, while also highlighting the limitations of fixed-backbone interaction-energy ranking as a predictor of binding affinity and yield. This study provides a practical framework for computationally driven, developability-aware antibody optimization in the absence of experimental structural data.

19

Zero-shot design of a de novo metalloenzyme

El Nesr, G.; Duerr, S. L.; Mathews, I. I.; Wen, Q.; Zhao, K.; Sarangi, R.; Roethlisberger, U.; Sunden, F.; Huang, P.

2026-04-24 biochemistry 10.64898/2026.04.23.720277 medRxiv

Top 0.1%

1.1%

Show abstract

The de novo design of enzymes remains a central challenge, requiring consideration of catalytic mechanism and optimization across biochemical and biophysical criteria. To capture these criteria, we draw on principles from evolutionary biology. Here, we present dEVA (design by EVolutionary Algorithm), a multi-objective design framework for structure-based protein design. We apply dEVA to the zero-shot, de novo design of metalloenzymes by optimizing for the coordination sphere of catalytic metals. We fully characterize one of these designs: a bi-zinc metalloenzyme exhibiting promiscuous hydrolytic activity towards both phosphomonoesters and phosphodiesters. This design achieves a catalytic efficiency (kcat/KM) of up to 1500 M-1s-1 and a rate enhancement ((kcat/KM)/kw) of up to 3 x 1013, comparable to characterized natural phosphatases. dEVA offers a general and modular strategy for the programmable design of protein function without dependence on natural templates, predefined motif, or evolutionary information.

20

SPAFESTWDILK, a plant-derived dodecapeptide from Zingiber officinale, as a predicted inhibitor of the MDM2-p53 interaction: computational discovery and multi-method evaluation

Ashtiani, M.; Romiti, M.; Sandri, C.; Paiola, G.

2026-06-10 biochemistry 10.64898/2026.06.06.730565 medRxiv

Top 0.2%

1.0%

Show abstract

The MDM2-p53 protein-protein interaction is a validated oncology target, yet no food-derived linear peptide has been documented to engage the canonical three-anchor MDM2-p53 interface. We developed a multi-stage computational pipeline (PepVeg) to screen 22 plant and fungal proteomes (337,646 proteins) for MDM2-binding peptides, applying sequential in silico hydrolysis, physicochemical filtering, ESM-2 embedding-based dimensionality reduction, and pharmacophore-driven selection. Twenty-six candidates were evaluated by AlphaFold 3 (AF3) co-folding against MDM2(25-109), yielding 15 binders (iPTM >= 0.75; 58% of evaluated). A 36-peptide benchmark with 29 hard negatives confirmed AF3 discriminative power (Cohens d = 3.41; 95% CI: 1.94-4.88; Hedges g = 3.32; zero overlap). The lead candidate, SPAFESTWDILK -- a tryptic fragment of Zingiber officinale histone deacetylase (UniProt A0A8J5FLH2) -- was evaluated by eight computational assessments: AF3 Server (iPTM 0.83, SD 0.01), Protenix (iPTM 0.923), Chai-1 (iPTM 0.891), EvoEF2 (-55.57 EEU), two GROMACS simulations (no dissociation across two force fields), and two MM-PBSA calculations (-75.30 (SD 4.92) and -55.07 (SD 2.86) kcal/mol). The W8A point mutant produced an iPTM drop of 0.201, closely paralleling the p53 W23A drop of 0.193; we predict W8A substitution will abolish binding. SPAFESTWDILK ranked only #890/2,000 by ESM-2 similarity and was recovered solely through pharmacophore matching, demonstrating that no single pipeline stage alone is sufficient. To our knowledge, this is the first food-database-derived linear peptide with multi-convergent computational evidence supporting engagement of the canonical three-anchor MDM2-p53 interface. Experimental validation by SPR/ITC is warranted.